Previously, we saw the basic recipe for applying a supervised machine learning model:
The first two pieces of this are the most important part of using these tools and techniques effectively.
The question that comes up is: How can we make informed choice for these parameters?
We've touched upon questions from this realm already, but here we are going to examine it in more detail.
In [ ]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
Next we choose a model and hyperparameters:
In [ ]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)
Then we train the model, and use it to predict labels for data we already know:
In [ ]:
model.fit(X, y)
y_model = model.predict(X)
Finally, we compute the fraction of correctly labeled points:
In [ ]:
from sklearn.metrics import accuracy_score
accuracy_score(y, y_model)
We see an accuracy score of 1.0, which indicates that 100% of points were correctly labeled by our model!
But is this truly measuring the expected accuracy?
Have we really come upon a model that we expect to be correct 100% of the time?
In [ ]:
from sklearn.model_selection import train_test_split
# split the data with 50% in each set
X1, X2, y1, y2 = train_test_split(X, y, random_state=0, train_size=0.5)
# fit the model on one set of data
model.fit(X1, y1)
# evaluate the model on the second set of data
y2_model = model.predict(X2)
accuracy_score(y2, y2_model)
The nearest-neighbor classifier is about 90% accurate on this hold-out set, which is more inline with out expactation.
The hold-out set is similar to unknown data, because the model has not "seen" it before.
But, we have lost a portion of our data to the model training.
In the above case, half the dataset does not contribute to the training of the model! This is not optimal.
One way to address this is to use cross-validation; that is, to do a sequence of fits where each subset of the data is used both as a training set and as a validation set. Visually, it might look something like this:
Here we do two validation trials, alternately using each half of the data as a holdout set. Using the split data from before, we could implement it like this:
In [ ]:
y2_model = model.fit(X1, y1).predict(X2)
y1_model = model.fit(X2, y2).predict(X1)
accuracy_score(y1, y1_model), accuracy_score(y2, y2_model)
We could compute the mean of the two accuracy scores to get a better measure of the global model performance.
This particular form of cross-validation is a two-fold cross-validation.
We could expand on this idea to use even more trials, and more folds.
Here is a visual depiction of five-fold cross-validation:
This would be rather tedious to do by hand, and so we can use Scikit-Learn's cross_val_score
convenience routine to do it succinctly:
In [ ]:
from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5)
This gives us an even better idea of the performance of the algorithm.
How could we take this idea to its extreme?
The case in which our number of folds is equal to the number of data points.
This type of cross-validation is known as leave-one-out cross validation, and can be used as follows:
In [ ]:
from sklearn.model_selection import LeaveOneOut
import numpy as np
loo = LeaveOneOut()
loo.get_n_splits(X)
scores = []
for train_index, test_index in loo.split(X):
model.fit(X[train_index], y[train_index])
scores.append(accuracy_score(y[test_index], model.predict(X[test_index])))
scores = np.array(scores)
scores
Because we have 150 samples, the leave one out cross-validation yields scores for 150 trials, and the score indicates either successful (1.0) or unsuccessful (0.0) prediction. Taking the mean of these gives an estimate of the error rate:
In [ ]:
scores.mean()
This gives us a good impression on the performance of our model. But there is also a problem. Can you spot it?
Sometimes the results are counter-intuitive:
Fundamentally, the question of "the best model" is about finding a sweet spot in the tradeoff between bias and variance. Consider the following figure, which presents two regression fits to the same dataset:
Consider what happens if we use these two models to predict the y-value for some new data.
In the following diagrams, the red/lighter points indicate data that is omitted from the training set:
It is clear that neither of these models is a particularly good fit to the data, but they fail in different ways.
The score here is the $R^2$ score, or coefficient of determination, which measures how well a model performs relative to a simple mean of the target values.
From the scores associated with these two models, we can make an observation that holds more generally:
Imagine we have the ability to tune the model complexity, then we can expect the training score and validation score to behave as illustrated in the following figure:
The diagram shown here is often called a validation curve, and we see the following essential features:
Let's look at an example. We will use a polynomial regression model: this is a generalized linear model in which the degree of the polynomial is a tunable parameter. For example, a degree-1 polynomial fits a straight line to the data; for model parameters $a$ and $b$:
$$ y = ax + b $$A degree-3 polynomial fits a cubic curve to the data; for model parameters $a, b, c, d$:
$$ y = ax^3 + bx^2 + cx + d $$We can generalize this to any number of polynomial features.
In Scikit-Learn, we can implement this with a simple linear regression combined with the polynomial preprocessor.
We will use a pipeline to string these operations together:
In [ ]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
def PolynomialRegression(degree=2, **kwargs):
return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))
Now let's create some data to which we will fit our model:
In [ ]:
import numpy as np
def make_data(N, err=1.0, rseed=1):
# randomly sample the data
rng = np.random.RandomState(rseed)
X = rng.rand(N, 1) ** 2
y = 10 - 1. / (X.ravel() + 0.1)
if err > 0:
y += err * rng.randn(N)
return X, y
X, y = make_data(40)
We can now visualize our data, along with polynomial fits of several degrees:
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # for beautiful plotting
X_test = np.linspace(-0.1, 1.1, 500)[:, None]
plt.scatter(X.ravel(), y, color='black') # plot data
axis = plt.axis()
# plot ploynomials
for degree in [1, 3, 5]:
y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test, label='degree={0}'.format(degree))
plt.xlim(-0.1, 1.0)
plt.ylim(-2, 12)
plt.legend(loc='best');
A useful question to answer is this: what degree of polynomial provides a suitable trade-off between bias (under-fitting) and variance (over-fitting)?
We can make progress in this by visualizing the validation curve
validation_curve
convenience routine
In [ ]:
from sklearn.model_selection import validation_curve
degree = np.arange(0, 21)
train_score, val_score = validation_curve(PolynomialRegression(), X, y,
'polynomialfeatures__degree', degree, cv=10)
plt.plot(degree, np.median(train_score, 1), color='blue', label='training score')
plt.plot(degree, np.median(val_score, 1), color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.xlabel('degree')
plt.ylabel('score');
This shows precisely the behavior we expect:
Which degree ist optimal?
Let's plot this.
In [ ]:
plt.scatter(X.ravel(), y)
lim = plt.axis()
y_test = PolynomialRegression(3).fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test);
plt.axis(lim);
In [ ]:
X2, y2 = make_data(200)
plt.scatter(X2.ravel(), y2);
We will duplicate the preceding code to plot the validation curve for this larger dataset; for reference let's over-plot the previous results as well:
In [ ]:
degree = np.arange(51)
train_score2, val_score2 = validation_curve(PolynomialRegression(), X2, y2,
'polynomialfeatures__degree', degree, cv=10)
plt.plot(degree, np.median(train_score2, 1), color='blue', label='training score')
plt.plot(degree, np.median(val_score2, 1), color='red', label='validation score')
plt.plot(degree[:train_score.shape[0]], np.median(train_score, 1), color='blue', alpha=0.3, linestyle='dashed')
plt.plot(degree[:train_score.shape[0]], np.median(val_score, 1), color='red', alpha=0.3, linestyle='dashed')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.xlabel('degree')
plt.ylabel('score');
From the validation curve it is clear that the larger dataset can support a much more complicated model:
Thus we see that the behavior of the validation curve has not one but two important inputs:
It is often useful to explore the behavior of the model as a function of the number of training points.
This can be done by using increasingly larger subsets of the data to fit our model.
This is called a learning curve.
The general behavior of learning curves is this:
With these features in mind, we would expect a learning curve to look qualitatively like that shown in the following figure:
In [ ]:
from sklearn.model_selection import learning_curve
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
fig.subplots_adjust(left=0.0625, right=0.95, wspace=0.1)
for i, degree in enumerate([2, 9]):
N, train_lc, val_lc = learning_curve(PolynomialRegression(degree),
X, y, cv=10,
train_sizes=np.linspace(0.3, 1, 25))
ax[i].plot(N, np.mean(train_lc, 1), color='blue', label='training score')
ax[i].plot(N, np.mean(val_lc, 1), color='red', label='validation score')
ax[i].hlines(np.mean([train_lc[-1], val_lc[-1]]), N[0], N[-1],
color='gray', linestyle='dashed')
ax[i].set_ylim(0, 1)
ax[i].set_xlim(N[0], N[-1])
ax[i].set_xlabel('training size')
ax[i].set_ylabel('score')
ax[i].set_title('degree = {0}'.format(degree), size=14)
ax[i].legend(loc='best')
Plotting a learning curve for your particular choice of model and dataset can help you to make this type of decision about how to move forward in improving your analysis.
What is the difference between a validation curve and a learning curve?
Scikit-Learn provides automated tools to do this in the grid search module.
Here is an example of using grid search to find the optimal polynomial model.
We will explore a three-dimensional grid of model features:
This can be set up using Scikit-Learn's GridSearchCV
meta-estimator:
In [ ]:
from sklearn.model_selection import GridSearchCV
param_grid = {'polynomialfeatures__degree': np.arange(21),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=10)
fit()
method will fit the model at each grid point, keeping track of the scores along the way:
In [ ]:
grid.fit(X, y);
Now that this is fit, we can ask for the best parameters as follows:
In [ ]:
grid.best_params_
Finally, if we wish, we can use the best model and show the fit to our data using code from before:
In [ ]:
model = grid.best_estimator_
plt.scatter(X.ravel(), y)
lim = plt.axis()
y_test = model.fit(X, y).predict(X_test)
plt.plot(X_test.ravel(), y_test);
plt.axis(lim);